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Abstract 

We describe a technique for finding pixelwise correspondences between two images by using models of objects 
of the same class to guide the search. The object models are "learned" from example images (also called 
prototypes) of an object class. The models consist of a linear combination of prototypes. The flow fields giving 
pixelwise correspondences between a base prototype and each of the other prototypes must be given. A novel 
image of an object of the same class is matched to a model by minimizing an error between the novel image and 
the current guess for the closest model image. Currently, the algorithm applies to line drawings of objects. An 
extension to real grey level images is discussed. 
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1 Introduction 

The problem of image correspondence is basic to 
computer vision and arises in a number of vision appli- 
cations such as stereo disparity, object recognition and 
motion estimation. General solutions such as optical 
flow techniques for computing the pixelwise correspon- 
dences between two images only work when the dif- 
ferences between the two images are relatively small. 
When the two images have large differences such as 
large rotations or changes in shape, then general meth- 
ods for computing correspondences break down. For 
many applications, prior knowledge is available about 
the contents of the images for which the correspon- 
dence is being computed. This knowledge may be 
exploited in order to create a more robust correspon- 
dence algorithm. This is the approach discussed in 
this paper. We describe an algorithm for model-based 
matching which uses a simple model of a class of ob- 
jects to find the correspondence between a novel view 
of an object of the same class and a standard "proto- 
typical" view. Instead of using 3D models for objects, 
we build models from 2D example views of the objects. 
Our technique requires that the pixelwise correspon- 
dences between each example view and the standard 
example view be given by the user (presumably by 
semiautomatic techniques) in the training stage. Cur- 
rently we are concerned with matching line drawings 
although straightforward extensions should allow the 
algorithm to be used with real images. Hence, this 
paper focuses on models of the shape of objects which 
do not take into account their textures. 

2 Related work 

Other researchers have studied techniques for con- 
straining the search for correspondences by assuming 
a model for the form of valid flow fields. For example, 
Cootes and Taylor ([5, 6, 7]) proposed Active Shape 
Models (ASMs) which is similar to the approach we 
are taking. An ASM is built by first manually identify- 
ing a number of control points on a real image of an ob- 
ject. After the same control points are identified on a 
number of different images of the same object, a prin- 
cipal components analysis is done on the matrix con- 
sisting of vectors of control points. This yields a set of 
eigenvectors which describe the directions (in control 
point space) of greatest variation along which the con- 
trol points change. An ASM is then the linear combi- 
nation of eigenvectors plus parameters for translation, 
rotation and scaling. An ASM is matched to a novel 
image of the object by an algorithm that searches a 
region in the novel image around the current position 
of each control point to find a position of better fit for 
each control point and then updates the parameters 
of the ASM accordingly. Two of the main differences 
of their approach relative to ours are the fitting algo- 
rithm used (ours is a gradient based approach) and 
the use of a dense pixelwise flow field as opposed to 
a sparse vector of control points. Also, Cootes and 
Taylor match shape models (which are basically line 
drawings) to real images whereas we match line draw- 
ings to line drawings and also describe a method for 
matching real image models to real images. 

Another group of researchers, Bergen, Anandan, 



Hanna and Hingorani [1], have described a framework 
for grey-level motion estimation. Their work is based 
on defining an error function which must be minimized 
to find the optimal flow field between two images. The 
error function they use is the sum of squared differ- 
ences between one image and a warping of the other 
image according to the current estimate of the flow 
field. Bergen et al. constrain the flow field to ad- 
here to some preselected form or model. The error 
is then minimized with respect to the parameters of 
the model by the Gauss-Newton minimization algo- 
rithm. The particular model used to constrain the 
flow can be selected according to the particular appli- 
cation. The ones discussed in Bergen et al. are rather 
general: affine flow, planar surface flow, rigid body 
motion and general optical flow. The main difference 
between their work and ours is the type of model used. 
Our models are learned from examples and are specific 
to a particular object class. 

The main motivation for our work is the linear class 
concept of Poggio and Vetter [11, 9] that justifies mod- 
eling an object in terms of a linear combination of pro- 
totypes. Poggio and Vetter showed that linear trans- 
formations can be learned exactly from a small set of 
examples in the case of linear object classes. Further- 
more, many object transformations such as 3D rota- 
tions of a rigid object and changing expression of a 
face can be approximated by linear transformations, 
that can be learned from a small number of examples. 
The same motivation underlies the work of Beymer [2] 
who describes an alternative approach, also based on 
a linear combination of prototypes, to vectorize grey- 
level images. 

3 Model-based matching using proto- 
types 
3.1 The model 

We would like the models used for model-based 
matching to be learned from examples as opposed to 
being hardwired. To learn a model, a number of exam- 
ples or prototypes of an object are given which show 
how the object can change. For example, to learn a 
model of a face with varying pose and facial expres- 
sion, several examples of the face at different poses 
and with different expressions would be given to the 
system. 

In addition to the prototype images, we require that 
pixelwise correspondences be given between one of the 
prototypes (usually the "average" prototype) which is 
chosen to be the base image and each of the other pro- 
totypes. In practice the correspondences are specified 
by the user during this "learning" stage in a semiau- 
tomatic way using special tools. 

Given the correspondences, each prototype can be 
"vectorized" - written as a vector of points. In prac- 
tice each prototype is represented as two matrices, one 
with the displacements in the x direction from each 
point in the base image to the corresponding point in 
the prototype and one with the y displacements. We 
define a model in this framework to be a linear combi- 
nation of vectorized prototypes or equivalently a linear 
combination of example flow fields (see also [10, 3]). 



To write the models mathematically, we must first 
introduce some notation. Let Jo be the base prototype 
image to which all the correspondences reference. Let 
N be the total number of prototypes. Let Dxi be the 
matrix of displacements in the x direction mapping 
the coordinates of base image Jo to the corresponding 
coordinates of prototype Ii . Similarly, let Dyi be the 
matrix of y displacements. Together, Dxi and Dyi 
make up a flow field. The model images consist of all 
images whose flow field is a linear combination of the 
prototype flow fields plus an affine transformation. In 
symbols, 

JV-l 

Dx' = Y^ (ciDxi) + p X + Pl Y + p 2 

i 

N-l 

Dy' = ^ (ciDyi) + p 3 X + p 4 Y + p b 

i 

The Dx' and Dy' matrices are the flow field de- 
scribing model image I'. Each row of the constant 
matrix X is (-w/2, -w/2 + 1, ..., -1, 0, 1, ..., w/2 - 
l,w/2) where w is the width of the prototype im- 
ages. Similarly, each column of the constant matrix y 
is (-ft/2, -ft/2 + 1, ..., -1, 0, 1, ..., ft/2- 1, ft/2) T where 
ft is the height of the images. 

These equations describe the flow fields for the 
model images. To actually get the grey level repre- 
sentation of /', it is necessary to warp base image Jo 
according to Dx' and Dy' and thereby render the ma- 
trices Dx' and Dy' as a black and white image. If the 
warp function simply moves pixels in the base image 
according to the flow field (without doing any blurring 
or hole filling) then a model image can be written 

I'(x + Dx'(x,y),y + Dy'(x,y)) = I (x,y). 

To obtain prototype line drawings and the associ- 
ated correspondences in practice, a drawing program 
is used. A model of a new object is made by first creat- 
ing a line drawing of the base image. The base image is 
usually the approximate average image in terms of the 
various object transformations one wants to represent. 
Next, new examples of the object are drawn by chang- 
ing the lines and curves of the base prototype. The 
pixelwise correspondences between the base prototype 
and each additional prototype can then be computed 
automatically since the equations describing the lines 
and curves in each prototype are known. A typical 
example base of prototype images is shown in figure 
1. 

3.2 Matching novel images 

Now that the prototypes have been defined, we 
want to use them to find the pixelwise correspondence 
between the base prototype and a novel image that is 
in the same object class as the prototypes. The gen- 
eral strategy for matching the novel image will be to 
define an error between the novel image and the cur- 
rent guess for the closest model image after rendering 
it and then try to minimize this error with respect to 
the linear coefficients c\ and the affine parameters pi. 



Following this strategy, we define the sum of squared 
differences error 



x,y 



where 
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x = x+ ^ aDxi(x,y) +Pox+PiV+P2, 
i=i 

N-l 

y = y+^2 c i D Vi( x ^y) + ps^ + p4?/ + p5, 
i=i 

the sum is over all pixels (x, y) in the images, I novel is 
the novel grey level image being matched and I model 
is the model grey level image. Assuming the simplest 
warping function, 

I model (x,y) = I (x,y). 
In this case, the error can be written 

E{c, P ) = l -Y.\. inovel ^y)- I ^y)f 

x,y 

The sum of squared differences error depends on the 
model parameters and gives a measure of the distance 
between the novel image and the current guess for the 
model image. Minimizing the error yields the model 
image which best fits the novel image. 

In order to minimize the error function, the 
Levenberg-Marquardt algorithm ([12]) is used (a sim- 
ilar use of Levenberg-Marquardt is described in [13]). 
This algorithm requires the derivative of the error with 
respect to each parameter. The necessary derivatives 
are as follows: 
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Figure 1: A typical example base of prototype line drawings. 
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Given these derivatives, the Levenberg-Marquardt 
algorithm can be used straightforwardly to find the 
optimal c and p. Notice that the algorithm is a se- 
quence of vectorization and rendering (through warp- 
ing) steps. 

3.3 Improving performance 

Implementing the minimization described in the 
previous section using line drawings as prototypes 
does not work well when the initial model parame- 
ters are far from the optimal ones. There are a couple 
of standard techniques we can use that improve the 
performance of the matching significantly. 

The first improvement is to simply blur the line 
drawings. Since only the black pixels are important 
in a line drawing, a blurring algorithm is used which 
only blurs the black pixels onto the white background. 
Using blurred line drawings makes the minimization 
more robust in the sense that the initial parameters 
can be much further away from the optimal ones for 
the minimization to succeed. 

The second improvement is to use a coarse-to-fine 
approach. This is a standard technique in computer 
vision ([4]). The idea is to create a pyramid of im- 
ages with each higher level of the pyramid containing 
an image that is one fourth the size of the one below. 
The flow fields must also be subsampled, and all x 
and y displacements must be divided by 2. Levenberg- 
Marquardt is used to fit the model parameters start- 
ing at the coarsest level, and then these parameters 
are used as the starting point at the next level. The 



constant affine parameters (p 2 and p§) must be mul- 
tiplied by 2 as they are passed down the pyramid to 
account for the increased size of the images. 

The coarse-to-fine approach also significantly im- 
proves the robustness of the matching. When com- 
bined with blurring, the matching algorithm works 
well for a large range of settings of the initial param- 
eters. 

A stochastic gradient minimization algorithm (de- 
scribed in [14]) has also been tried in place of 
Levenberg-Marquardt. It was found to be much faster 
(around 25 times) and more robust in that it got 
caught in local minima less frequently. The results 
reported here are with the Levenberg-Marquardt algo- 
rithm because the stochastic gradient algorithm was 
implemented after the first draft of this paper. 

3.4 Pseudo code 

The following pseudo code describes the matching 
algorithm. 

1. Load novel image, I novel 

2. Load base prototype, Jo, and flow fields for the 
other prototypes, Dxi and Dyi 

3. Create image pyramids for I novel and Jo and for 
each Dxi and Dyi 

4. Blur all images in novel image pyramid 

5. Initialize parameters c and p (typically set to zero) 

For each level in the pyramid 

6. Estimate the parameters c and p using 
Levenberg-Marquardt 

When computing the error in Levenberg- 
Marquardt, the model image is created by 
warping 7 according to the current linear 
combination of prototype flow fields plus 
affine parameters and then the resulting 
model image is blurred. 

7. Multiply the constant affine parameters p^ and 
P5 by 2 

8. Go to next level 

9. Output the parameters 

3.5 Results 

Some preliminary tests have been done using our 
approach to model-based matching. In one such test, 
the prototype images in figure 1 were used to create 
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Figure 2: Results of matching novel images using the prototypes in figure 1 . The novel images are in the top 
row and the model images which were estimated are in the bottom row. 



a model of simple cartoon faces. The pixelwise corre- 
spondences between each prototype and the base pro- 
totype were obtained using the output of the drawing 
program on which the images were generated. The 
base prototype is the face in the upper left corner of 
figure 1. 

Novel images which were similar to those in the ex- 
ample base were created by hand. These images were 
drawn so that they were roughly normalized for trans- 
lation, scale and rotation. Figure 2 shows the results 
of fitting the model to the novel images. The top row 
of images are the novel images and the bottom row are 
the closest model images as estimated by the match- 
ing algorithm described above. The model parameters 
were all initialized to zero, which means the base pro- 
totype was used as the starting point for the matching 
algorithm. As the figure shows, the algorithm did a 
good job of finding a model image which matched well 
with each novel image. The lines in the model images 
are thicker due to a small amount of blurring that is 
done after warping in order to fill in "holes" left by 
warping. All model images are generated from their 
respective flow fields by warping the base image. 

4 Extensions 

4.1 A general hierarchical componentwise 
approach 

An affine transformation is included in the model 
because it allows for the novel image being matched 
to have moderate changes in scale, rotation and trans- 
lation from the model prototypes. In other words, the 
affine parameters provide some extra tolerance in the 
model. Of course, the affine parameters are global in 
the sense that they scale the whole image or rotate 
the whole image as opposed to affecting only a piece 
or a single feature of the image. This fact exposes one 
of the problems with the approach just described. It 
is brittle to translations, rotations or scaling of only a 
single feature in the image if this local variation is not 
accounted for by some of the prototypes. This is more 
of a problem for matching novel line drawings that a 
user has complete freedom in creating than with real 
images which are constrained by the physical world. 

One obvious solution to this problem is to use a 
componentwise approach in which images are treated 



as being composed of several different components, say 
eyes, mouth and nose. Each component would have 
its own model using the same formulation as in the 
previous section. In other words, each component is 
specified by a number of prototypes along with the 
pixelwise correspondences for each prototype. These 
components are then combined to form a complete im- 
age, say of a full face, by specifying where each compo- 
nent can be located. The location information is again 
specified using a number of prototypes for the whole 
image. These image prototypes would simply consist 
of x,y locations for each component. A number of 
image prototypes would be needed to show how each 
component could change location relative to the other 
components. The new componentwise model would 
be a linear combination of location vectors as well as 
a linear combination of individual component proto- 
types. 

We are extending this componentwise idea towards 
a potentially powerful hierarchical framework to allow 
more complicated images (with possibly multiple ob- 
jects). The idea is to build components from a linear 
combination of component prototypes and then build 
simple objects from a linear combination of positions 
of components and then build more complicated ob- 
jects from a linear combination of positions of simple 
objects and so on. 

4.2 Using real images 

Another ongoing extension to this work is to apply 
the matching algorithm to real grey level and color 
images as opposed to black and white line drawings. 

In this case, in addition to modeling the shape of 
objects, we also model the texture of objects. We 
model texture analogously to the way we modeled 
shape - as a linear combination of the grey level val- 
ues (texture) of the prototype images (see also [2], 
for an alternative approach to the same problem). A 
rather general justification of models of shape and tex- 
ture consisting of linear combinations of prototypical 
shapes and textures is the following. Under weak as- 
sumptions, one can prove that if any network can learn 
to synthesize shape or texture from examples then the 
desired shape or texture must be well approximated 
by a linear combination of the examples (see [3, 8]). 

Let {Ij} be the set of prototype images where Iq 



is the base image. Define D/j, the image of intensity 
differences between Ij and Jo, as 

DIj(x,y) = I j (x + Dx j (x,y),y + Dy j (x,y))-I (x,y). 

For any (x, y) in the base image, the corresponding 
model point is 

JV-l 

I™ d °\x,y) = I (x,y) + Yl h J DI J^y) 

3 = 1 



where 



JV-l 



x = x + ^2 CiDxi(x, y) + p x + piy + 
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y = y 



+ J^ CiDyi{x,y) + Psx+P4y+P5- 
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In other words, the new position of the pixel at lo- 
cation (x,y) in the base image is determined by a 
linear combination of prototype positions (given by 
Dxi(x,y) and Dyi(x, y)), and the new grey level value 
of the pixel is determined by a linear combination of 
prototype grey level values for that pixel. The two lin- 
ear combinations, for shape and texture respectively, 
use the same set of prototype images but two different 
sets of coefficients. 

To match a novel grey level image, we can still use 
Levenberg-Marquardt. The minimization is now with 
respect to the vector of grey level coefficients b as well 
as to c and p. 

5 Applications 
5.1 Image analysis 

One problem that model-based matching can be ap- 
plied to is the problem of image analysis. By image 
analysis we mean the problem of determining certain 
parameters describing an image such as the pose or ex- 
pression parameters of an image of a face for example. 
Our approach to image analysis is to learn a mapping 
from images to their corresponding parameters (see 
[3]). The representation used for the images is critical 
in this approach. For example, trying to find a map- 
ping from the raw grey level matrix of an image to its 
associated parameters would not result in a mapping 
which generalized to new images. This is because the 
grey level values of an image do not change smoothly 
as the objects in the image change smoothly. Instead 
of using the grey level representation, Beymer et al. 
find the pixelwise correspondences for each example 
image and use the vector of labelled points for each 
image as the image representation. They call the vec- 
tor of labelled points the "vectorized" representation 
of an image. Thus to analyze a new image, it must first 
be converted into the vectorized representation. To do 
this we can use the model-based matching approach 
previously described instead of other techniques such 
as optical flow. Thus, our approach to image analysis 
is to first define a model as described in section 3.1 



from a set of prototype images and their flow fields. 
The analysis parameters (such as pose) are also given 
for each prototype. A mapping is then learned which 
maps the vectorized prototypes to their corresponding 
analysis parameters. A novel image is analyzed by first 
matching the linear combination of prototypes model 
to the image as described in section 3.2. After match- 
ing, the resulting correspondences are used to create 
the vectorized representation for the novel image. The 
parameters of the novel image are then calculated by 
applying the previously learned mapping to the vec- 
torized representation of the novel image as described 
in [3]. 

As described briefly in section 3.5, we have written 
a system for analyzing line drawings such as those in 
figure 1. The system learns to analyze sketches from 
a user who trains the system with prototype exam- 
ples. The system is first trained with prototypes of 
line drawings of an object along with the pixelwise cor- 
respondences. Given a set of prototypes, the system 
attempts to match a novel line drawing which is ap- 
proximately in the space of images spanned by the pro- 
totypes using the algorithm of section 3.4. The model 
parameters which are found by the matching can be 
used as the analysis parameters for the image. Alter- 
natively, the model parameters can be mapped by an 
approximation network to a possibly higher level set 
of analysis parameters (see [3]. Examples of the higher 
level parameters would be given with each prototype. 

5.2 Man-machine interface 

Image analysis can be used to build a general man- 
machine interface or a gesture recognition system ([3]). 
For example, if a model of a hand were built from ex- 
ample views of a hand then novel views of a hand could 
be analyzed to recover their position and orientation. 
These parameters could then be used as input to a 
computer to control things the same way a 3D mouse 
does. Other possibilities for a man-machine interface 
are analyzing facial expression and using it as input 
to the computer. 

Other potential applications for model-based 
matching are object recognition, very low bandwidth 
teleconferencing and virtual reality simulations. 

6 Discussion and conclusions 

We have described a robust algorithm for model- 
based matching. Using object models to guide the 
matching algorithm may be essential in cases where 
the differences between two images of an object are 
too great for a general correspondence algorithm to 
work well. The need for prior knowledge in the form 
of object models comes from the fact that optical flow 
is an underconstrained problem although other ways 
of adding constraints have of course been used (see for 
example [1, 13]). 

The linear combination of prototypes model that 
we described has several advantages. It is a simple 
learning-from-examples model that only requires 2D 
views as opposed to a 3D model. It has a quite deep 
motivation since the linear combination of prototypes 
model is intimately related to general properties of 
a very broad class of synthesis networks of the type 



described by [3]. A new model is fairly simple to cre- 
ate since all that is required are a number of example 
views of the object class and the pixelwise correspon- 
dences for each. Most importantly, the matching al- 
gorithm works well in practice. One problem with 
this approach is the need for the correspondences for 
each prototype. In general we expect that once a good 
vocabulary of models is created, new models will not 
need to be created very often. 

Acknowledgements: The authors would like to 
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